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Abstract 

Background: Systematic reviews are a cornerstone of evidence-based medicine but are useful only if up-to-date. 
Methods for detecting signals of when a systematic review needs updating have face validity, but no proposed 
method has had an assessment of predictive validity performed. 

Methods: The AHRQ Comparative Effectiveness Review program had produced 13 comparative effectiveness 
reviews (CERs), a subcategory of systematic reviews, by 2009, 1 1 of which were assessed in 2009 using a 
surveillance system to determine the degree to which individual conclusions were out of date and to assign a 
priority for updating each report. Four CERs were judged to be a high priority for updating, four CERs were Judged 
to be medium priority for updating, and three CERs were judged to be low priority for updating. AHRQ then 
commissioned full update reviews for 9 of these 1 1 CERs. Where possible, we matched the original conclusions 
with their corresponding conclusions in the update reports, and compared the congruence between these pairs 
with our original predictions about which conclusions in each CER remained valid. We then classified the 
concordance of each pair as good, fair, or poor. We also made a summary determination of the priority for 
updating each CER based on the actual changes in conclusions in the updated report, and compared these 
determinations with the earlier assessments of priority. 

Results: The 9 CERs included 149 individual conclusions, 84% with matches in the update reports. Across reports, 
83% of matched conclusions had good concordance, and 99% had good or fair concordance. The one instance of 
poor concordance was partially attributable to the publication of new evidence after the surveillance signal 
searches had been done. Both CERs originally judged as being low priority for updating had no substantive 
changes to their conclusions in the actual updated report. The agreement on overall priority for updating between 
prediction and actual changes to conclusions was Kappa = 0.74. 

Conclusions: These results provide some support for the validity of a surveillance system for detecting signals 
indicating when a systematic review needs updating. 
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Background 

Systematic reviews are a cornerstone of evidence-based 
care, either by themselves or through their incorporation 
into practice guideUnes, performance measures or other 
evidence-based practice. To be useful, however, system- 
atic reviews need to be up-to-date. 

The science of determining when systematic reviews 
need updating has been developing for the past decade. 
Prior to 2001, no method or criterion existed to 
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determine whether evidence-based products remained 
valid or whether the evidence underlying them had been 
superseded by newer work. Since then, several groups 
have begun developing methods to determine signals for 
updating reviews [1-5]. Most methods involve some 
form of limited literature searches and the use of expert 
opinion, although some methods use statistical methods 
and are applicable only to meta-analytic results [6,7]. 
Two of these methods have been formally compared and 
found to produce similar results [2]. To date, however, 
no method has been assessed for predictive validity, 
meaning there is no way of determining whether the 
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presence or absence of signals does in fact predict 
whether the review is out-of-date. In addition to the 
more easily assessed situation of a false-positive (that is, 
a signal that detects that a review is out-of-date, but the 
subsequent update does not result in any important 
changes in the conclusions), such a study requires being 
able to assess for false-negatives, which requires updating 
reviews for which no signals are detected. In 2008, we 
were asked to determine which of 11 systematic reviews 
sponsored by the Agency for Healthcare Research and 
Quality (AHRQ) Comparative Effectiveness Review (CER) 
program might be in need of updating. We took advantage 
of a natural experiment to assess the predictive validity of 
our method for assessing for signals for updating. 

Methods 

In this study, we assessed the predictive validity of sig- 
nals for updating CERs detected in 2009 that have since 
been updated. We start with a description of the original 
process used to detect signals [3] and then describe how 
we assessed the validity of the signals. This original 
process subsequently evolved to the process described 
by Ahmadzai et al. [8]; the two are nearly identical. 

The 2009 method for detecting signals 
Identifying new evidence from published studies 

Search strategy. We started by using the search strategy 
employed in the original report. However, we limited the 
search (which included at least MEDLINE/PubMed and/ 
or Cochrane Reviews, as well as, on a topic-specific 
basis, additional databases) to five top-rated general 
interest medical journals (Annals of Internal Medicine, 
British Medical Journal, Journal of the American Medical 
Association, The Lancet and New England Journal of 
Medicine) and the specialty journals most relevant to the 
topic. The specialty journals were those most highly rep- 
resented among the references from the original report 
(four to sbc specialty journals). We also modified the key 
terms if, for example, we were aware of new drugs for 
the condition, adding their names to the search terms. 

Table 1 Criteria for determining signals for updating 

Label Indications for the need for an update 



Search inception dates were 6 to 12 months prior to the 
end date of the original CER search in order to ensure 
overlap between the searches. 

Study selection and extraction. Using the same general 
inclusion and exclusion criteria as the original CER, a 
single reviewer experienced in systematic reviews con- 
ducted a screening of the titles and abstracts and re- 
quested any articles deemed relevant to the topic. From 
among those articles, the reviewer extracted relevant data 
from articles that met the inclusion criteria and then con- 
structed an evidence table. These data included study- 
level details extracted in the original CER (for example, 
sample size, study design, and outcomes measured) as well 
as the outcomes themselves. 

Identifying new evidence from experts and expert opinion. 
For each topic, we created a questionnaire matrix that listed 
the key questions and conclusions from the original execu- 
tive summary. The matrix was sent to experts in the field, 
including the original project leader, technical expert panel 
members and peer reviewers. The experts were asked to 
indicate whether each conclusion listed in the matrix was, 
to their knowledge, still valid and, if not, to describe any 
new evidence and provide citations. 

Assessing individual conclusions for signals. Once ab- 
straction of the study conditions and findings for each 
new included study was completed and expert opinions 
were received, we assessed, on a conclusion-by-conclusion 
basis, whether the new findings provided a signal for the 
need for an update. Table 1 lists the criteria used for mak- 
ing these determinations [9]. 

For each CER, we constructed a summary table that 
included the following for each key question: original 
conclusions, findings of the new literature search, sum- 
mary of expert assessment, our final assessment of the 
currency of the conclusions, and the priority for updating. 

Determining priority for updating a CER. We needed 
to make an overall judgment regarding the priority for 
updating an entire CER. This determination rested on two 
criteria. (1) How much of the CER is possibly, probably or 
certainly out-of-date? (2) How out-of-date is that portion 



Still valid Original conclusion is still valid and this portion of the original report does not need updating. This conclusion was reached if 

we found no new evidence or only confirmatory evidence and all responding experts assessed the CER conclusion as still valid. 

Possibly out of date Original conclusion is possibly out of date and this portion of the original report may need updating. This conclusion was 
reached if we found some new evidence that might change the CER conclusion, and/or a minority of responding experts 
assessed the CER conclusion as having new evidence that might change the conclusion. 

Probably out of Original conclusion is probably out of date and this portion of the original report may need updating. This conclusion was 
date reached if we found substantial new evidence that might change the CER conclusion, and/or a majority of responding experts 

assessed the CER conclusion as having new evidence that might change the conclusion. 

Out of date Original conclusion is out of date. This conclusion was reached if we found new evidence that rendered the CER conclusion 

out of date or no longer applicable. Recognizing that our literature searches were limited, we reserved this category only for 
situations where a limited search would produce prima facie evidence that a conclusion was out of date, such as the 
withdrawal of a drug or surgical device from the market, a black box warning from FDA, etc. 
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of the CER? For example, we asked whether the potential 
changes to the conclusions would involve only refinement 
of original estimates or whether the potential changes 
would include the finding that some therapies are no 
longer favored or might no longer be in use. Another 
question was whether the portion of the CER that was 
probably or certainly out-of-date involved an issue of 
safety (for example, a drug withdrawn from the market, 
a US Food and Drug Administration black box warning) 
or the availability of a new drug within an existing class, 
with the latter being a less important signal to update 
than the former. This final determination was a global 
judgment made by all the individuals working on each 
particular CER. On the basis of that determination, we 
classified CERs as being of low, medium or high priority 
for updating. For high-priority updates, we also provided 
our rationale. 

Assessment of predictive validity 

Our 2009 work assessed 11 CERs. We classified four as 
having a high priority for updating, four as having a 
medium priority for updating and three as having a low 
priority for updating (see Table 2). One of the low-priority 
topics, comparative effectiveness of percutaneous coronary 
interventions and coronary artery bypass grafting for cor- 
onary artery disease, was considered a low priority for an 
update because AHRQ had already commissioned an indi- 
vidual patient data meta-analysis, which it considered to 
be an update of the CER and was published in 2009 [10]. 

AHRQ elected to support full updates of all of the 
remaining CERs except the report on clinically localized 
prostate cancer, for which they believed it would be 

Table 2 Comparative effectiveness reviews assessed^ 



prudent to wait for the pending PIVOT trial results [22]. 
This situation presented us with a natural experiment. 
Because all of the reports, regardless of update priority 
status, were going to get the gold standard of a complete 
update, we could assess for both false-positives (reports 
classified as high priority but having no major change in 
conclusions when updated) and false-negatives (reports 
classified as low priority that, when updated, had major 
changes in conclusions) based on the 2009 predictions. 
To do this experiment, we took each conclusion from the 
original CER and then tried to match it with the closest 
similar conclusion from the update. We then assessed the 
degree of concordance between the 2009 prediction and 
the updated conclusion. We used the criteria described 
below. 

1. Good: Concordance was considered good if the 
original prediction was "still valid" and there was no 
new relevant evidence or if new evidence continued 
to support the conclusion, or if the original 
prediction was "possibly out-of-date", "probably 
out-of-date" or "out-of-date" and new evidence 
appeared that changed the conclusions by a 
substantial amount. 

2. Fair: Concordance was considered fair if the original 
prediction was "still valid" and new evidence 
supported changes in some conclusions but not 
others or if the original prediction was "possibly out- 
of-date" but no new evidence was incorporated into 
the updated conclusions and there were no substantive 
changes from the original conclusions; or if the 
original prediction was "probably out-of-date" or 



CER 



2009 prediction Update commissioned 



Comparative Effectiveness of Management Strategies for Gastroesophageal Reflux Disease [11] 
Effectiveness of Noninvasive Diagnostic Tests for Breast Abnormalities [12] 

Comparative Effectiveness of Epoetin and Darbepoetin for Managing Anemia in Patients Undergoing Cancer 
Treatment [1 3] 

Comparative Effectiveness and Safety of Analgesics for Osteoarthritis [14] 

Efficacy and Comparative Effectiveness of Off-Label Use of Atypical Antipsychotics [1 5] 

Comparative Effectiveness of Drug Therapy for Rheumatoid Arthritis and Psoriatic Arthritis in Adults [1 6] 

Comparative Effectiveness of Treatments to Prevent Tractures in tVlen and Women with Low Bone Density or 
Osteoporosis [1 7] 

Comparative Effectiveness of Second-Generation Antidepressants in the Pharmacologic Treatment of Adult 
Depression [18] 

Comparative Effectiveness of Angiotensin-Converting Enzyme inhibitors (ACEIs) and Angiotensin 11 Receptor 
Antagonists (ARBs) for Treating Essential Hypertension [1 9] 

Comparative Effectiveness of Therapies for Clinically Localized Prostate Cancel^ [20] 

Comparative Effectiveness of Percutaneous Coronary Interventions and Coronary Artery Bypass Grafting for 
Coronary Artery Disease'^ [21] 

^AHRQ, Agency for Healthcare Research and Quality; CER, comparative effectiveness review. ^Update not commissioned pending publication of the PIVOT trial. 
^Update not commissioned or individual patient data meta-analysis had already been commissioned. 
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"out-of-date" and some conclusions or some aspects 
of the conclusions had changed but others had not. 
3. Poor: Concordance was considered poor if the 

original prediction was "still valid" but new evidence 
substantially changed the conclusions or if the 
original prediction was "probably out-of-date" or 
"out-of-date" but no new evidence was incorporated 
into the update and the conclusions underwent no 
substantive changes. 

Examples of the degree of concordance analysis are 
shown in Table 3. 

We assessed "concordance" rather than "agreement" 
because the matching of the original conclusions to up- 
dated conclusions was often challenging, and "agreement" 
implies a more direct comparison of original to updated 
conclusions than is always possible. For this reason, we 
refrained from using a 2 x 2 table to make comparisons. 

We then made a summary assessment of the CER's 
priority for updating, based on the updated conclusions. 
We used the same criteria as those in the prospective 
assessment: How much of the report was out-of-date 
and the degree to which it was out-of-date. Using the k 
statistic, we compared the agreement between the original 
assessment of priority and the actual changes. 

In the assessment of concordance of individual con- 
clusions, an additional complicating factor was the time 
delay between our limited literature searches to assess 
for signals (2008) and the search dates of the update 
reports (2010 to 2012). Therefore, for conclusions with 
poor concordance, we reviewed whether they may have 
been influenced by new evidence published after the 
surveillance signals search. 

Results 

We performed our assessment of predictive validity for 
nine CERs comprising 149 individual conclusions. For 
each CER, we present our assessment of the concordance 
of individual conclusions (Additional file 1) as well as a full 
table describing each conclusion and how it was assessed 
(Additional file 2). We also provide an overall table that 
sums up the individual conclusion assessments across all 
CERs (Table 4). 

The great majority (83%) of conclusions for each CER 
and across CERs had good concordance. However, the 
CER on gastroesophageal reflux disease (GERD) had four 
"out-of-date" conclusions with only fair concordance, and 
one conclusion we had assessed as "still valid" was shown 
to be out-of-date. 

The published 2009 updating assessment judged that 
the conclusion regarding endoscopic treatment for GERD 
"should be deleted", meaning that it was out-of-date, be- 
cause the endoscopic procedures had been withdrawn 
from the market. However, one of the three endoscopic 



procedures reviewed in the original report continued to be 
used, new endoscopic procedures were introduced and 
one of the two withdrawn procedures was later reintro- 
duced. The update report noted this changing landscape, 
and we deemed the concordance with the 2009 prediction 
as only fair. A more appropriate surveillance assessment 
would have been that the conclusion needed updating be- 
cause the endoscopic procedures were evolving over time. 

Another conclusion in the original GERD report — that 
surgery and medical therapy were similarly effective — 
was rated as "still valid" during the surveillance process 
but had poor concordance with the update review, which 
concluded that surgery was favored over medical therapy. 
One of the studies providing new evidence in support of 
this conclusion was published in 2009, after completion of 
the surveillance signal search. 

Table 5 compares our original predictions of the need 
for updating with the priority as determined by the actual 
update. One CER that was predicted in 2009 to be a high 
priority for updating was judged to have been a medium 
priority for updating based on the updated report. A CER 
determined to be a medium priority update was originally 
judged as having been a high priority for an update. The 
updating priority remained the same for the other seven 
CERs. Table 6 presents in a 3 x 3 table the results of the 
overall assessment of priority for updating. The k statistic 
for agreement was 0.74 (Table 6). 

Discussion 

This assessment of the predictive validity of a method 
to assess a CER for signals for updating yielded gener- 
ally favorable results. For the vast majority of individual 
conclusions, concordance between the 2009 predictions 
and the subsequent updated conclusions was judged to 
be good. The one instance of poor concordance had new 
evidence published after the surveillance signals had 
been assessed, and in this instance involved a CER already 
judged to be of high priority for updating based on signals 
of other out-of-date conclusions. 

Our present study has three primary limitations. The 
first is sample size. We were able to assess only nine 
CERs. However, this number included CERs assessed as 
being of high, medium or low priority, thus allowing us 
to assess the possibility of false-negatives (that is, CERs 
assessed as low priority for updating that nevertheless 
were fully updated). The likelihood of assessing such 
false-negatives again is small, as it requires that low- 
priority CERs be subjected to the gold standard of a full 
update. Our findings that neither of the CERs judged to 
be a low priority had any substantive changes in conclu- 
sions wiU reinforce the decision to invest scarce resources 
in researching other topics rather than commisioning 
updates of low-priority CERs. 
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Table 3 Examples of degree of concordance between 2009 prediction and updated conclusion^ 

Examples 



Predictions and conclusions 



Example 1 

Original conclusion (from CER on 
analgesics for osteoarthritis) 

2009 surveillance assessment [14] 

Conclusion from 201 1 CER 
update [23] 

Concordance 

Example 2 

Original conclusion (from CER on 
analgesics for osteoarthritis) 

2009 surveillance assessment [14] 

Conclusion from 201 1 CER 
update [23] 

Concordance 

Example 3 

Original conclusion (from CER on 
second-generation antidepressants) 

2009 surveillance assessment [16] 

Conclusion from 201 1 CER 
update [24] 

Concordance 
Example 4 

Original conclusion from CER on 
second-generation antidepressants 



No clear differences between various nonaspirin, nonselective NSAIDs or partially selective NSAIDs with regard 
to efficacy for pain relief or improvement 

Conclusion still valid 

No clear difference in efficacy for pain relief, or withdrawals due to lack of efficacy 
Good 

Etoricoxib is associated with fewer gastrointestinal adverse events than nonselective NSAIDs 
Possibly out-of-date 

No comparable conclusion, as etoricoxib was not included because it did not gain FDA approval for sale in the 
United States 

Good 



Overall discontinuation rates did not differ significantly between SSRIs as a class and bupropion, mirtazapine, 
nefazodone, trazodone and venlafaxine. In the case of venlafaxine compared with SSRIs, higher discontinuation 
rates due to adverse events appeared to be balanced by lower discontinuation rates due to lack of efficacy. 

Conclusion is possibly out-of-date, and this portion may need updating based on new analysis showing lower 
dropout rate with escitalopram. 

Meta-analyses of numerous efficacy trials indicate that overall discontinuation rates are similar. Duloxetine and 
venlafaxine have a higher rate of discontinuations due to adverse events than SSRIs as a class. Venlafaxine has a 
lower rate of discontinuations due to lack of efficacy than SSRIs as a class. 

Fair: Escitalopram data did not end up in the conclusions 



Three head-to-head RCTs suggest that no substantial differences exist between fluoxetine and sertraline, 
fluvoxamine and sertraline, and trazodone and venlafaxine regarding relapse. Twenty-one placebo-controlled 
trials support the general efficacy and effectiveness of most second-generation antidepressants for preventing 
relapse or recurrence. No evidence exists for duloxetine. 

Conclusion is possibly out-of-date, and this portion of the CER may need updating to include evidence for 
duloxetine. 

On the basis of results of six efficacy trials and one naturalistic study, no significant differences exist between 
escitalopram and desvenlafaxine, escitalopram and paroxetine, fluoxetine and sertraline, fluoxetine and 
venlafaxine, fluvoxamine and sertraline, and trazodone and venlafaxine for preventing relapse or recurrence. 

Fair: No duloxetine evidence ended up being included with regard to this key question 



2009 surveillance assessment [16] 

Conclusion from 2011 CER 
update [24] 

Concordance 
Example 5 

Original conclusion (from CER on 
management of GERD) 

2009 surveillance assessment [18] Conclusion is still valid, and this portion of the CER does not need updating. 

Conclusion from 201 1 CER The 2005 CER concluded that medical therapy with PPIs and antireflux surgery were similarly effective in 

update [25] improving GERD-related symptoms and decreasing esophageal acid exposure, although some surgical patients 

required ongoing medical therapy postprocedure. With the addition of long-term follow-up data (7 to 12 years) 
from two previously reviewed studies and results from two new RCTs, our updated review found that patients 
who underwent antireflux surgery experienced a greater improvement in heartburn and regurgitation at 
follow-up than did patients who received medical treatment alone. 



Medical therapy with PPIs and surgery (fundoplication) appeared to be similarly effective for improving 
symptoms and decreasing esophageal acid exposure. 



Concordance 



Poor: Update indicates symptoms are better with surgery 



''CER, comparative effectiveness review; FDA, US Food and Drug Administration; GERD, gastroesophageal reflux disease; NSAID, nonsteroidal anti-inflammatory 
drug; PPI, proton pump inhibitor; RCT, randomized controlled trial; SSRI, selective serotonin reuptake inhibitor. 



A second limitation is the matching of original conclu- 
sions to updated conclusions. In some updated reports, the 
authors themselves matched the conclusions. In most cases, 
however, this was not done, and, in some circumstances. 



determining the appropriate match to the original conclu- 
sion was challenging. Additional file 2 lists each original 
conclusion and its matching updated conclusion so that 
readers may judge this agreement for themselves. 
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Table 4 Summary of concordance of predicted and actual 
conclusions across nine comparative effectiveness 
reviews^ 



CER 


Good 


Fair 


Poor 


Total 


Still valid 


83 


1 


1 


85 


Possibly out-of-date 


11 


16 


0 


27 


Probably out-of-date 


7 


0 


0 


7 


Out-of-date 


4 


4 


0 


8 


Total 


105 


21 


1 


127 



'CER, comparative effectiveness review. Not applicable/no matching 
conclusions/new conclusions = 22. 



The third principal limitation of this study is that the 
2013 assessment of the 2009 predictions could not be 
made in a blinded fashion. Our Evidence-based Practice 
Center (EPC) did both assessments, and, even if some 
other group had done the 2013 assessment, we could 



not have enforced blinding, because the 2009 assessments 
are in the public domain. We tried to guard against bias 
by having explicit reasons for each judgment and pre- 
senting these reasons for readers themselves to judge. 
Our reasoning should be transparent. 

With the limitation of small sample size in mind, we 
offer the following preliminary conclusions about the 
surveillance signal method. (1) Low-priority CERs are 
unlikely to have any substantive changes in conclusions. 
(2) Conclusions judged likely to be "still valid" almost 
certainly are still valid. (3) Conclusions judged to be 
"out-of-date" almost certainly are out-of-date. (4) Safety 
concerns and the appearance of new classes of therapies 
and more efficacious treatments are the best targets for 
high-priority updates. (5) The classification of individual 
conclusions as possibly or probably out-of-date owing to 
new evidence may be slightly too sensitive as a signal; in a 
number of such instances, the update report's conclusion 



Table 5 Comparison of predicted vs. actual priority for updating^ 



CER 

Comparative Effectiveness of Management 
Strategies for Gastroesopliageal Refiux Disease [25] 



Effectiveness of Noninvasive Diagnostic Tests for 
Breast Abnormaiities [26] 



2009 
prediction 

Hig[i 



High 



Comparative Effectiveness of Epoetin and High 
Darbepoetin for IVlanaging Anemia in Patients 
Undergoing Cancer Treatment [27] 

Comparative Effectiveness and Safety of Anaigesics High 
for Osteoarthritis [23] 



End date of 
update search 

August 2010 



September 
2010 



April 2012 



January 201 



Efficacy and Comparative Effectiveness of Off-Labei Medium IVlay 201 1 
Use of Atypical Antipsyctiotics [28] 



Comparative Effectiveness of Drug Therapy for 
Rheumatoid Arthritis and Psoriatic Arthritis in 
Adults [29] 

Comparative Effectiveness of Treatments to Prevent Medium IVlarch 201 1 
Tractures In iVIen and Women with Low Bone 
Density or Osteoporosis [30] 



Comparative Effectiveness of Second-Generation Low January 201 

Antidepressants in the Pharmacologic Treatment of 
Aduit Depression [24] 

Comparative Effectiveness of Angiotensin-Converting Low December 
Enzyme inhibitors (ACEIs) and Angiotensin il 2010 
Receptor Antagonists (ARBs) for Treating Essential 
i-iypertenslon [31] 



2013 

assessment 

High 



Medium 

High 
High 

IVIedium 



Medium February 201 1 High 



IVIedium 



Low 



Low 



Rationale 

Some procedures specifically mentioned in the 
Executive Summary have been withdrawn from the 
market. New procedures have been introduced. 
There is a major change in the conclusion about 
surgery vs. medical therapy. 

The new data did not change the overall conclusions 
very much. The conclusion that MRI and ultrasound 
may be sufficient to evaluate lesions in women at 
low risk may be an important new conclusion. 

IVIajor safety concerns leading to substantial changes 
in black box warnings and practice guidelines 

The updated Executive Summary specifically 
mentions a number of drugs that have been 
withdrawn because of safety concerns. 

There are many new off-label indications and data 
on effectiveness, but these do not indicate strong 
effects of these drugs. 

New, expensive biologic DIViARDs feature 
prominently in the Executive Summary of the 
updated report. 

There are two new drugs: zoledronic acid and 
denosumab. However, there is no evidence that they 
are any more effective than existing drugs. There are 
signals of serious but rare new side effects, in 
particular subtrochanteric fractures of the hip, but 
they are not sufficient to change the initial decision 
to recommend antiresorptive therapy for women 
with osteoporosis. 

No substantive changes in conclusions 



No substantive changes in conclusions 



'DIVIARD, disease-modifying antirPieumatic drug; IVIRI, magnetic resonance imaging. 
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Table 6 Predictive validity of priority for updating a 
systematic review (2009 predicted priority)^ 





Priority based on actual changes in conclusions 


Priority 


High- 2013 
assessment 


iVIedium- 2013 Low- 2013 
assessment assessment 


Overall 


High- 2009 
prediction 


3 


1 0 


4 


Medium- 2009 
prediction 


1 


2 0 


3 


Low- 2009 
prediction 


0 


0 2 


2 


Total 


4 


3 2 


9 



'k = 0.74. 



did not change, because the new evidence identified in the 
signal search was either rejected or insufficient to change 
the original conclusion. 

In sum, our assessment provides some support for the 
predictive validity of this method of assessing CERs for 
signals of the need for updating. Future research is likely 
to be confined to assessing updates of systematic reviews 
judged to be a medium or high priority for updating. 
Further assessment of the factors leading to changes in 
individual conclusions may help refine the criteria for 
distinguishing between high- and medium-priority update 
topics. However, investing extra time and effort to distin- 
guish "possibly" from "probably" out-of-date conclusions 
or to further refine the global assessment to distinguish 
medium- from high-priority update topics may begin to 
make the surveillance process resemble the actual update, 
which is not the goal of surveillance. In this application, 
the surveillance process worked very well — nearly per- 
fectly, in fact (k > 0.8 is considered nearly perfect agree- 
ment). No low-priority CER was judged, as having had a 
substantive change to a conclusion in the update, whereas 
3 of 4 high priority CERs did have substantive changes to 
the conclusions. The results suggest that it is very unlikely 
that new, practice-changing evidence exists concerning a 
systematic review judged to be a low priority for updating 
and supports a policy of delaying an update of a systematic 
review until new evidence is sufficient to warrant assigning 
it at least a medium priority. 

The assessment method described herein represents 
part of the basis for the surveillance method used to assess 
AHRQ systematic reviews as described by Ahmadzai et al. 
[8]. That program was designed to assess each AHRQ 
systematic review every 6 months and to take 3 months 
to complete. One important result is that no systematic 
review was judged to be a high priority for updating at 
the first 6-month assessment, meaning that it is probably 
more cost-effective to assess systematic reviews no more 
frequendy than yearly. Additional work on making surveil- 
lance more cost-effective is warranted. 



Conclusion 

In our present study, we found evidence supporting the 
predictive validity of a method for assessing AHRQ sys- 
tematic reviews regarding their need for updating. One 
advantage of this method relative to other proposed 
methods is that it is equally useful for meta-analytic 
reviews and narrative reviews. It may be applicable to 
systematic reviews produced by other organizations. 
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